Lecture 7 : Full feedback and adversarial rewards ( part II )

نویسنده

  • Alex Slivkins
چکیده

Previously, we introduced the best-expert problem, and we proved a O(ln K) mistake bound for the majority vote algorithm when a perfect expert exists, i.e., there is an expert that never makes mistakes. Now let us turn to the more realistic case where there is no perfect expert among the committee. We extend the majority vote algorithm with a confidence weight. At each round, we maintain a weight w i for each expert i, and we choose the prediction that has the highest total weights. After observing the feedback, we decay the weights of incorrect experts with a factor of (1 −). This algorithm is called Weighted Majority Algorithm (WMA).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lecture 7 : Full feedback and adversarial rewards ( part I )

A real-life example is the investment problem. Each morning, we choose a stock to invest. At the end of the day, we observe not only the price of our target stock but prices of all stocks. Based on this kind of “full“ feedback, we determine which stock to invest for the next day. A motivating special case of “bandits with full feedback” can be framed as a question-answering problem with experts...

متن کامل

Lecture 2 : Bandits with i . i . d rewards ( Part II )

So far we’ve discussed non-adaptive exploration strategies. Now let’s talk about adaptive exploration, in a sense that the bandit feedback of different arms in previous rounds are fully utilized. Let’s start with 2 arms. One fairly natural idea is to alternate them until we find that one arm is much better than the other, at which time we abandon the inferior one. But how to define ”one arm is ...

متن کامل

Lecture 6 + 7 : Adversarial Bandits

So far, we have been talking about multi-armed bandits where the rewards are stochastic, generated independently and identically from a fixed unknown distribution for each arm. Today, we’ll look at a different setup: adversarial rewards. Instead of there being a distribution for each arm, we assume there is a hidden sequence for each arm i, ri,1, ..., ri,T . We observe ri,t if we pull arm i at ...

متن کامل

The Effect of Communication Skills Training through Video Feedback Method on Interns' Clinical Competency

Introduction: There are methodological challenges on the subject of communication skills training despite general agreement on its advantages. This study was performed to compare the effect of communication skills training through video feedback with the usual method of lecture. Methods: This quasi-experimental double-blind prospective study was performed on two groups of 20 interns in the ye...

متن کامل

Deterministic MDPs with Adversarial Rewards and Bandit Feedback

We consider a Markov decision process with deterministic state transition dynamics, adversarially generated rewards that change arbitrarily from round to round, and a bandit feedback model in which the decision maker only observes the rewards it receives. In this setting, we present a novel and efficient online decision making algorithm named MarcoPolo. Under mild assumptions on the structure o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016